Rene Perez.
The following is a sample of products created during the “Data Visualization and Reproducible Research” course.
In this project, I explored The Billboard Summer Hits and the
Students Performance data set; on this file, I am focusing in the
Students Performance, I want find the relationship between the
Exam Score and the other features like time in netflix,
study hrs and others, so I explored the data set creating plot
and a multiple linear regression analysis. Please the Find the code and
report in theproject_01/ folder
Sample data visualization:
| Dependent variable: | |
| exam_score | |
| age | -0.012 |
| (0.074) | |
| genderMale | 0.146 |
| (0.348) | |
| genderOther | 0.793 |
| (0.866) | |
| study_hours_per_day | 9.575*** |
| (0.116) | |
| social_media_hours | -2.602*** |
| (0.145) | |
| netflix_hours | -2.282*** |
| (0.158) | |
| part_time_jobYes | 0.211 |
| (0.414) | |
| attendance_percentage | 0.143*** |
| (0.018) | |
| sleep_hours | 1.992*** |
| (0.138) | |
| diet_qualityGood | -0.683* |
| (0.378) | |
| diet_qualityPoor | -0.272 |
| (0.473) | |
| exercise_frequency | 1.450*** |
| (0.084) | |
| parental_education_levelHigh School | -0.160 |
| (0.396) | |
| parental_education_levelMaster | -0.411 |
| (0.508) | |
| parental_education_levelNone | -0.702 |
| (0.633) | |
| internet_qualityGood | -0.473 |
| (0.373) | |
| internet_qualityPoor | -0.082 |
| (0.503) | |
| mental_health_rating | 1.944*** |
| (0.060) | |
| extracurricular_participationYes | -0.014 |
| (0.364) | |
| Constant | 7.177*** |
| (2.503) | |
| Observations | 1,000 |
| R2 | 0.902 |
| Adjusted R2 | 0.900 |
| Residual Std. Error | 5.342 (df = 980) |
| F Statistic | 473.908*** (df = 19; 980) |
| Note: | p<0.1; p<0.05; p<0.01 |
This is a very strong model. The most important predictors of exam performance are:
📚 Study hours (positive)
📱 Social media and 📺 Netflix use (negative)
🛏️ Sleep (positive)
💪 Exercise and 😊 Mental health (positive)
🏫 Attendance (positive)
In this project, I explored the California Housing Data set,
and find the relationship between the price of house, and proximity to
the ocean; for instance, I explored the data set creating plot and a
multiple linear regression analysis. Please the Find the code and report
in the project_02/ folder.
California Housing:
| Dependent variable: | |
| median_house_value | |
| longitude | -26,812.990*** |
| (1,019.651) | |
| latitude | -25,482.190*** |
| (1,004.702) | |
| housing_median_age | 1,072.520*** |
| (43.886) | |
| total_rooms | -6.193*** |
| (0.791) | |
| total_bedrooms | 100.556*** |
| (6.869) | |
| population | -37.969*** |
| (1.076) | |
| households | 49.617*** |
| (7.451) | |
| median_income | 39,259.570*** |
| (338.005) | |
| ocean_proximityINLAND | -39,284.300*** |
| (1,744.258) | |
| ocean_proximityISLAND | 152,901.900*** |
| (30,741.880) | |
| ocean_proximityNEAR.BAY | -3,954.052** |
| (1,913.339) | |
| ocean_proximityNEAR.OCEAN | 4,278.134*** |
| (1,569.525) | |
| Constant | -2,269,954.000*** |
| (88,013.880) | |
| Observations | 20,433 |
| R2 | 0.646 |
| Adjusted R2 | 0.646 |
| Residual Std. Error | 68,656.950 (df = 20420) |
| F Statistic | 3,111.608*** (df = 12; 20420) |
| Note: | p<0.1; p<0.05; p<0.01 |
The model fits reasonably well (R² ≈ 0.65).
Most variables are statistically significant.
median_income is the strongest positive predictor.
Location features (longitude, latitude, ocean_proximity) are very important.
Population and housing structure (rooms, households) affect value but may be entangled in multicollinearity1.
In this project, I explored different visualizations, using geom_point(), geom_density(), and other ggplot visualizations; next I will show a Tampa weather plot.
Sample data visualization:
Summary of Insights:
Hot months (Jun–Aug) are warm and often wet, especially June and July.
Cold months (Dec–Feb) have lower average temperatures and relatively less rainfall.
Transitional months (Mar–May, Sep–Nov) show mixed weather, with both dry and wet days.
Next steps:
Keep exploring the ggplot, SF, and other visualization packages.
Work in visualizations for machine learning models.
Keep working on mapping visualizations for spatial data.
Multicollinearity happens when two or more predictor variables in a regression model are highly correlated with each other. This means they contain overlapping information, which makes it hard for the model to determine which variable is actually influencing the outcome.↩︎